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Transfer reinforcement learning (RL) methods leverage on the experience col- 
lected on a set of source tasks to speed-up RL algorithms. A simple and effective 
approach is to transfer samples from source tasks and include them in the train- 
c/3 | ing set used to solve a target task. In this paper, we investigate the theoretical 

O • properties of this transfer method and we introduce novel algorithms adapting the 

transfer process on the basis of the similarity between source and target tasks, 
^vq \ Finally, we report illustrative experimental results in a continuous chain problem. 

> 

1 Introduction 

(N 

^sO ' The objective of transfer in reinforcement learning (RL) [ 12 1 is to speed-up RL algorithms by reusing 

00 , knowledge (e.g., samples, value function, features, parameters) obtained from a set of source tasks. 

■ The underlying assumption of transfer methods is that the source tasks (or a suitable combination 

of these) are somehow similar to the target task, so that the transferred knowledge can be useful in 
learning its solution. A wide range of scenarios and methods for transfer in RL have been studied 
in the last decade (see |[T4l l9l for a thorough survey). In this paper, we focus on the simple transfer 
approach where trajectory samples are transferred from source MDPs to increase the size of the 
training set used to solve the target MDR This approach is particularly suited in problems (e.g., 
robotics, applications involving human interaction) where it is not possible to interact with the envi- 
ronment long enough to collect samples to solve the task at hand. If samples are available from other 
sources (e.g., simulators in case of robotic applications), the solution of the target task can benefit 
from a larger training set that includes also some source samples. This approach has been already 
investigated in the case of transfer between tasks with different state-action spaces in 1 1 3 1 , where the 
source samples are used to build a model of the target task whenever the number of target samples is 
not large enough. A more sophisticated sample-transfer method is proposed in [8|. The authors in- 
troduce an algorithm which estimates the similarity between source and target tasks and selectively 
transfers from the source tasks which are more likely to provide samples similar to those generated 
by the target MDR Although the empirical results are encouraging, the proposed method is based 
on heuristic measures and no theoretical analysis of its performance is provided. On the other hand, 
in supervised learning a number of theoretical works investigated the effectiveness of transfer in 
reducing the sample complexity of the learning process. In domain adaptation, a solution learned on 
a source task is transferred to a target task and its performance depends on how similar the two tasks 
are. In [2] and [ 10] different distance measures are proposed and are shown to be connected to the 
performance of the transferred solution. The case of transfer of samples from multiple source tasks 
is studied in |3|. The most interesting finding is that the transfer performance benefits from using a 
larger training set at the cost of an additional error due to the average distance between source and 
target tasks. This implies the existence of a transfer tradeoff between transferring as many samples 
as possible and limiting the transfer to sources which are similar to the target task. As a result, the 
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transfer of samples is expected to outperform single-task learning whenever negative transfer (i.e., 
transfer from source tasks far from the target task) is limited w.r.t. to the advantage of increasing the 
size of the training set. This also opens the question whether it is possible to design methods able to 
automatically detect the similarity between tasks and adapt the transfer process accordingly. 

In this paper, we investigate the transfer of samples in RL from a more theoretical perspective w.r.t. 
previous works. The main contributions of this paper can be summarized as follows: 

• Algorithmic contribution. We introduce three sample-transfer algorithms based on fitted Q- 
iteration [4|. The first algorithm (AST in Section[3]) simply transfers all the source samples. 
We also design two adaptive methods (BAT and BTT in Section|4]and |5]l whose objective 
is to solve the transfer tradeoff by identifying the best combination of source tasks. 

• Theoretical contribution. We formalize the setting of transfer of samples and we derive a 
finite-sample analysis of AST which highlights the importance of the average MDP ob- 
tained by the combination of the source tasks. We also report the analysis for BAT which 
shows both the advantage of identifying the best combination of source tasks and the addi- 
tional cost in terms of auxiliary samples needed to compute the similarity between tasks. 

• Empirical contribution. We report results (in Section |6]l on a simple chain problem which 
confirm the main theoretical findings and support the idea that sample transfer can signifi- 
cantly speed-up the learning process and that adaptive methods are able to solve the transfer 
tradeoff and avoid negative transfer effects. 

The rest of the paper is organized as follows. In Section[2]we introduce the notation and we define 
the transfer problem. Section[3]reports the theoretical analysis of AST. BAT is described in Section|4] 
along with its theoretical analysis. A more challenging setting is introduced in Section [5] together 
with BTT. Section|6]reports the experimental results and Section [7] concludes the paper. Finally, in 
the appendix we report the proofs and some additional experimental analysis. 

2 Preliminaries 

In this section we introduce the notation and the transfer problem considered in the rest of the paper. 

Notation for MDPs. We define a discounted Markov decision process (MDP) as a tuple M. = 
(X, AjTZj'P, 7) where the state space X is a bounded closed subset of the Euclidean space, A is a 
finite (\A\ < 00) action space, the deterministic^ reward function TZ : X x A —> K is uniformly 
bounded by R m &x, the transition kernel V is such that for all x € X and a £ A, V(-\x,a) is 
a distribution over X, and 7 6 (0, 1) is a discount factor. We denote by S(X x A) the set of 
probability measures over A" x Aandby B(X x A; V max = ^j^) the space of bounded measurable 

functions with domain X x A and bounded in [— V max , V max \. We define the optimal action-value 
function Q* as the unique fixed-point of the optimal Bellman operator T : B(X x A; V max ) — > 
B(X x A; V max ) defined by 



Notation for function spaces. For any measure \i e S(X x A) obtained from the combination 
of a distribution p € S (X) and a uniform distribution over the discrete set A, and a measurable 
function / : X x A —> K, we define the L 2 (^)-norm of / as ||/|| 2 = ^2 a ^A fx f( x > a ) 2 p(dx). 
The supremum norm of / is defined as ||/||oo = sup^g^ l/C^)!- Finally, we define the standard 
L2-norm for a vector a 6 R d as ||a|| 2 = Ylt=i a i- We denote by (/>(•, •) = (c/?i(-, •),■■■, <fd(', 0) T 
a feature vector with features ipi : X x A — > [— C, C], and by T = {/«(•, •) = (/>(•, -) Ta } me li near 
space of action-value functions spanned by the basis functions in <p. Given a set of state-action pairs 
{{Xi,Ai)}f =1 , let $ = [4>(Xt, Ai) T ; . . . ; 4>(Xl, Al) t ] be the corresponding feature matrix. We 
define the orthogonal projection operator II : B(X x A; V max ) —> J- as IIQ = arg min ferWQ — 
fW^. Finally, by T(Q) we denote the truncation of a function Q in the range [ 

1 The extension to stochastic reward functions is straightforward. 
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Input: Linear space J~ = span{i^i, 1 < i < d}, initial function Q° g 
for fe = 1, 2, ... do 

Build the training set {(Xi,Ai, Yi, Ri)}^^ [according to random tasks design] 
Build the feature matrix $ = [<j>{X\, Ai) T ; . . . ; A L ) T ] 
Compute the vector p £ R L withp; = _R; + 7max a / g _4 Q fe_1 (V;, a') 
Compute the projection a fc = ($ T $) _1 <I> T p and the function Q k = f &k 
Return the truncated function Q k = T(Q' 1 ) 
end for 



Figure 1: A pseudo-code for All-Sample Transfer (AST) Fitted Q-iteration. 



Problem setup. We consider the transfer problem in which M tasks {A4 m }^ =1 are available and 
the objective is to learn the solution for the target task A4i transferring samples from the source 
tasks {.M m }m=2- We define an assumption on how the training sets are generated. 

Definition 1. (Random Tasks Design) An input set {(Xi, Ai)}[ =1 is built with samples drawn from 
an arbitrary sampling distribution p G S(X x A), i.e. (Xi,Ai) ~ p. For each task m, one 
transition and reward sample is generated in each of the state-action pairs in the input set, i.e. 
Y" 1 ~ V{-\Xi, Ai), and R\ a — 1Z(Xi, A{). Finally, we define the random sequence {M{\f =l where 
the indexes M\ are drawn i.i.d. from a multinomial distribution with parameters (Ai, . . . , Xm)- The 
training set available to the learner is {(Xi, A;, Yi, Ri)}f =1 where Yi = Yj,jvf, and Ri = Ri t M r 

This is an assumption on how the samples are generated but in practice, a single realization of 
samples and task indexes M; is available. We consider the case in which Ai <C A m (m = 2, . . . , M). 
This condition implies that (on average) the number of target samples is much less than the source 
samples and it is usually not enough to learn an accurate solution for the target task. We will also 
consider the pure transfer case in which Ai = (i.e., no target sample is available). Finally, we 
notice that Def. Q] implies the existence of a generative model for all the MDPs, since the state- 
action pairs are generated according to an arbitrary sampling distribution p. 



3 All- Sample Transfer Algorithm 

We first consider the case when the source samples are generated beforehand according to Def. Q] 
and the designer has no access to the source tasks. We study the algorithm called All-Sample 
Transfer (AST) (Fig. [TJ which simply runs FQI with a linear space T on the whole training set 
{(Xi, At, Yi, Ri)}^ =1 - At each iteration k, given the result of the previous iteration Q k ~ 1 = 
T(Q k ~ 1 ), the algorithm returns 

L 2 

Q k = eii-gmmjY^(f(X l ,A l )-(R l + 1 ma,xQ k - 1 (Y l ,a'))) . (1) 

In the case of linear spaces, the minimization problem is solved in closed form as in Fig.[T] In the 
following we report a finite-sample analysis of the performance of AST. Similar to ifTTI . we first 
study the prediction error in each iteration and we then propagate it through iterations. 



3.1 Single Iteration Finite-Sample Analysis 

We define the average MDP A^a as the average of the M MDPs at hand. We define its reward 
function 1Z \ and its transition kernel "Pa as the weighted average of reward functions and transition 
kernels of the basic MDPs with weights determined by the proportions A of the multinomial distribu- 
tion in the definition of the random tasks design (i.e., 1Z\ — Ylm=i ^m^m> 'Px — J2 m =i ^m'Pm)- 
The resulting average Bellman operator is 

M 

(TxQ){x,a) = ( Vr"Q) (x, a) = K(x, a) + 7 / msxQ{y,a')V{dy\x,a). (2) 
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In the random tasks design, the average MDP plays a crucial role since the implicit target function 
of the minimization of the empirical loss in Eq.Q]is indeed T\Qk-i- At each iteration k, we prove 
the following performance bound for AST. 

Theorem 1. Let M be the number of tasks {Mm}m=i> with A4i the target task. Let the training 
set {(Xi, Ai,Yi, Ri)}f = i be generated as in Def.\T\ with a proportion vector A = (Ai, . . . , Am)- Let 
f ai = HTiQ^ 1 = arginf/e^llZ-TiQ^IU then for any < 5 < 1, Q k (Eq.\B satisfies 

\\T(Q k ) - T x Q k -% < 4||/ aJ - TiQ k ~% + ^ExiQX- 1 ) 

/~2 9 /~2 / ' 27(\2Le 2 ) 2 ( d+1 ) \ 
+ 24(T/ max + C\\al\\)^ log | + 32V ma Jj- log J . 

with probability 1 — S (w.r.t. samples), where \ \ifii\ |oo < C and £\(Q k ~ 1 ) — | (71 — T \)Q k ~ 1 \\ 2 ] l - 

Remark 1 (Analysis of the bound). We first notice that the previous bound reduces (up to constants) 
to the standard bound for FQI when M = 1 (see Section[B|i. The bound is composed by three main 
terms: (i) approximation error, (ii) estimation error, and (Hi) transfer error. The approximation 
error | \f a k — T\Q k ~ 1 | L is the smallest error of functions in T in approximating the target function 
T\Q k ~ x and it is independent from the transfer algorithm. The estimation error (third and fourth 
terms in the bound) is due to the finite random samples used to learn Q k and it depends on the 
dimensionality d of the function space and it decreases with the total number of samples L with the 
fast rate of linear spaces (0(d/L) instead of 0(\J d/ L)). Finally, the transfer error £\ accounts for 
the difference between source and target tasks. In fact, samples from source tasks different from the 
target might bias Q k towards a wrong solution, thus resulting in a poor approximation of the target 
function 71 Q It is interesting to notice that the transfer error depends on the difference between 
the target task and the average MDP Ai\ obtained by taking a linear combination of the source tasks 
weighted by the parameters A. This means that even when each of the source tasks is very different 
from the target, if there exists a suitable combination which is similar to the target task, then the 
transfer process is still likely to be effective. Furthermore, £\ considers the difference in the result 
of the application of the two Bellman operators to a given function Q k ~ 1 . As a result, when the two 
operators 71 and T\ have the same reward functions, even if the transition distributions are different 
(e.g., the total variation a) — V\(-\x, a)||Tv is large), their corresponding averages of Q k ~ x 

might still be similar (i.e., J max a / Q(y, a')T'i(dy\x, a) similar to J max a < Q(y, a')V \(dy\x, a)). 

Remark 2 (Comparison to single-task learning). Let Q k be the solution obtained by solving one 
iteration of FQI with only samples from the source task, the performance bounds of Q k and Q k can 
be written as (up to constants and logarithmic factors) 

\\T(Q k )-T 1 Q k -%<\\f ai 

llT^-TiQ^IU^II/oj 

with N\ = X\L (on average). Both bounds share exactly the same approximation error. The main 
difference is that Q k uses only N\ samples and, as a result, has a much bigger estimation error than 
Q k , which takes advantage of all the L samples transferred from the source tasks. At the same time, 
Q k suffers from an additional transfer error which does not exist in the case of Q k . Thus, we can 
conclude that AST is expected to perform better than single-task learning whenever the advantage 
of using more samples is greater than the bias due to samples coming from tasks different from the 
target task. This introduces a transfer tradeoff between including many source samples, so as to 
reduce the estimation error, and finding source tasks whose combination leads to a small transfer 
error. In Section|4]we show how it is possible to define an adaptive transfer algorithm which selects 
proportions A so as to keep the transfer error £\ as small as possible. Finally, in Section [5] we 
consider a different setting where the maximum number of samples in each source is fixed. 



71 Q 
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a + (V max + C\\a k \\)\lj- + V m 



TxQ k - l \\a + {V m ^ + C\\al\ 



V m 
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3.2 Propagation Finite-Sample Analysis 

We now study how the previous error is propagated through iterations. Let v be the evaluation norm 
(i.e., in general different from the sampling distribution p). We first report two assumptions. Q 

Assumption 1. [11 ] Given p, v, p > 1, and an arbitrary sequence of policies {n p } p >i, we assume 
that the future- state distribution pV\ x ■ • • V\ is absolutely continuous w.r.t. v. We assume that 
c{p) = sup^...^ \\d{pVl t ■ ■■VlJMloc satisfies = (1 - 7 2 ) 2 EpPT"" 1 ^) < oo. 

We also need the features (fi to be linearly independent w.r.t. p. 

Assumption 2. Let G € M. dxd be the Gram matrix with [G]ij = J (fi(x, a)ipj(x, a)p,(dx, a). We 
assume that its smallest eigenvalue uj is strictly positive (i.e., uj > 0). 

Using the two previous assumptions we derive the following performance bound for AST. 
Theorem 2. Let Assumptions\l\and \2\hold and the setting be as in Theorem]!} After K iterations, 
AST returns an action-value function Qk, whose corresponding greedy policy ttk satisfies 
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HQ* - Q? K \\v < j^^/jV^ 



4 sup inf ||/ -71ff|U + 5sup ||(71 -TA)T(/ a )|| 



56(K iax H ^)W-log— +32KiaxW T log ' i 



Remark (Analysis of the bound). The bound reported in the previous theorem displays few dif- 
ferences w.r.t. to the single-iteration bound. The additional term 0(^f K ) accounts for the error 
due to the finite number of iterations of FQI and it decreases exponentially with base 7. The ap- 
proximation error is now sup s inf / | \f — T x g\ This term is referred to as the inherent Bellman 
error ifTTl of the space T and it is related to how well the Bellman images of functions in T can 
be approximated by T itself. It is possible to show that for particular classes of MDPs (e.g., Lips- 
chitz), if a large enough number of carefully designed features is available, then this term is small. 
In the estimation error, the norm \ \ocl\ \ is bounded using the linear independency between features 
(Assumption |2]i and the boundedness of the functions Q k returned at each iteration. The resulting 
term has an inverse dependency on the smallest eigenvalue uj which tends to be small whenever the 
Gram matrix is not well-defined (i.e., the features are almost linearly dependent). The transfer er- 
ror sup Q || (71 — T\)T(f a )\\ft characterizes the difference between the target and average Bellman 
operators through the space T . As a result, even MDPs with significantly different rewards and tran- 
sitions might have a small transfer error because of the functions in T . This introduces a tradeoff 
in the design of T between a "large" enough space containing functions able to approximate 71 Q 
(i.e., small approximation error) and a small function space where the Q-functions induced by 71 
and Tx can be closer (i.e., small transfer error). This term also displays interesting similarities with 
the notion of discrepancy introduced in [10] in domain adaptation. 



4 Best Average Transfer Algorithm 

As discussed in the previous section, the transfer error £\ plays a crucial role in the comparison with 
single-task learning. In particular, £\ is related to the proportions A inducing the average Bellman 
operator T\ which defines the target function approximated at each iteration. We now consider 
the case where the designer has direct access to the source tasks (i.e., it is possible to choose how 
many samples to draw from each source) and can define an arbitrary proportion A. In particular, we 
propose a method that adapts A at each iteration so as to minimize the transfer error £ \. 

We consider the case in which L is fixed as a parameter of the algorithm and Ai = (i.e., 
no target samples are used in the learning training set). At each iteration k, we need to esti- 
mate the quantity £\(Q k ~ 1 ). We assume that for each task additional samples available. Let 

{(X s , A s , R s ,i, ■ ■ ■ , R s ,M)}f=i be an auxiliary training set where (X S ,A S ) ~ p and R Stm = 

2 We refer to [ 1 1 j for a thorough explanation of the concentrability terms. 
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Input: Space T = span-}^, 1 < i < d}, initial function Q° 6 J 7 , number of samples L 
Build the auxiliary set {(X s , A a , R s ,i, • ■ • , -Rs,m }f=i and {Kj*i,. • ., Y* m}T=i f° r eacn s 

for fe = 1, 2, . . . do 

Compute X k — argminAeA £ \{Q k ~ 1 ) 

Run one iteration of AST (Fig.[]} using L samples generated according to \ k 
end for 



Figure 2: A pseudo-code for the Best Average Transfer (BAT) algorithm. 



lZ m (X s , A s ). In each state-action pair, we generate T next states for each task, that is Y t 



t 

s,m 



V m {-\X S , A s ) with t = 1, . . . , T. Thus, for any function Q we define the estimated transfer error as 

s ' M T M 

£\(Q) = -^y2 ^»,i~yi A m-R s ,m + 7f~y~] { max Q(Ys,i, a) — ^ AmmaxQ^^a' 

O _ J- \ d q, 1 



2 

■ (3) 



At each iteration, the algorithm fiesf Average Transfer (BAT) (Fig. |2]l first computes X k = 
arg miriAeA £ \(Q k ~ 1 ), where A is the (M-2)-dimensional simplex, and then runs an iteration of 
AST with samples generated according to the proportions A fe . We denote by A* the best combina- 
tion at iteration k, that is 



M 

( Yl *m(T m Q k - 1 )(x, a) - (T'Q^Kx, a) 



(4) 



A* = arg min £\(Q k 1 ) = argminEu 

AeA AeA 

The following performance guarantee can be proved for BAT. 

Lemma 1. Let {(X s , A s , iij, . . . , i?^)}f =1 be a training set where (X s , A s ) ~ fi and R™ = 
7Z m (X e , A s ) and for each state-action pair and for each task m, T next states Y s m t ~ V m (-\ X s , A s ) 

with t — 1, . . . , T are available. For any fixed bounded function Q € B(X x „4; V^nax), ?/ze A 
returned by minimizing Eq.\3\is such that 



£~ X (Q) - £x. (Q) < 2V maA l ± ' fa 1 + 16V^„ b - 1 (5) 



'(M -2) log AS 1 5 , icT/2 log AS_M/S 

T 

with probability 1 — 5. 



From the previous lemma the approximation performance of BAT at each iteration follows. 
Theorem 3. Let Q k ~ 1 be the function returned at the previous iteration and Qg AT the function 

BAT?"*" 



returned by the BAT algorithm (Fig.\2}. Then for any < 5 < 1, Qg AT satisfies 



\\T{Qbat) - TiQ k -% < 4||/ aJ - T!Q k -% + 5j£ xi (Q*-i) 



1/4 



(M — 2) log 85/(5 \ 1og8SM/S 



5* / m<*A y 



24(F max + C| K| | W T log — + 32K„ ax W T log 



vv/f/z probability 1 — 5. 

Remark 1 (Comparison with AST and single-task learning). The analysis of the bound shows 
that BAT outperforms AST whenever the advantage in achieving the smallest possible transfer error 
£ X k is larger than the additional estimation error due to the auxiliary training set. It is also interesting 
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to compare BAT to single-task learning. In fact, BAT performs better than single-task learning 
whenever the best possible combination of source tasks has a small transfer error and the additional 
estimation error related to the auxiliary training set is smaller than the estimation error in single- 
task learning. In particular, this means that 0((M/S') 1 / 4 ) + 0{{l/Tf/ 2 ) should be smaller than 
O^d/N) 1 / 2 ) (with N the number of target samples). The number of calls to the generative model 
for BAT is ST. In order to have a fair comparison with single-task learning we set S — A 2 / 3 and 
T = A 1 / 3 , then we obtain the condition M < d 2 N 4 / 3 that constrains the number of tasks to 
be smaller than the dimensionality of the function space T . We remark that the dependency of the 
auxiliary estimation error on M is due to the fact that the A vectors (over which the transfer error is 
optimized) belong to the simplex A of dimensionality M-2. Hence, the previous condition suggests 
that, in general, adaptive transfer methods may significantly improve the transfer performance (i.e., 
in this case a smaller transfer error) at the cost of additional sources of errors which depend on the 
dimensionality of the search space used to adapt the transfer process (i.e., in this case A). 

Remark 2 (Iterations). BAT recomputes the proportions X k at each iteration k. In fact a combina- 
tion Ai approximating well the reward function 1Z\ at the first iteration (i.e., 1Z\ « 1Z\i) does not 
necessarily have a small transfer error 1 1 (7i — T\i )Q 1 \ L at the second iteration. We further investi- 
gate how the best source combination changes through iterations in the experiments of Section [6] 

Remark 3 (Best source combination). The previous theorem shows that BAT achieves the smallest 
transfer error £ X k (Q k ~ 1 ) at the cost of an additional estimation error which scales with the size of the 

auxiliary training set as 0((M/S) 1 / 4 ) + 0((l/T) 1 / 2 ). We notice that the first term of the estimation 
error depends on how well the /i is approximated by using a finite number S of state-action pairs 
and it has a slower rate w.r.t. the other terms. The second term depends on the number of next states 
T simulated at each state-action pair which are used to estimate the value of the Bellman operators. 
As a result, in order to reduce the estimation error we need to increase both S and the number of 
next states T in each state-action pair. It is interesting to notice that similar estimation errors appear 
in FVI 1 1 1 1 where the optimal Bellman operator is approximated by Monte-Carlo estimation. 

Remark 4 (Training set). The implicit assumption in the definition of the auxiliary training set is 
that it is possible to generate a series of next states and rewards for all the tasks at the same state- 
action pairs. If the source training sets are fixed in advance and the designer has no access to the 
source tasks, then this assumption is not verified and it is not possible to test the similarity between 
the MDP A4 and the target task. Nonetheless, if the generative model for the source tasks is available 
at learning time, the auxiliary training set could be generated before the learning phase actually 
begins. Furthermore, in the theoretical analysis, BAT does not use the samples in the auxiliary 
training set at learning time. A trivial improvement is to include the auxiliary samples to the training 
set. 

Remark 5 (Comparison to other transfer methods). In [[8] a method to compute the similarity 
between MDPs is proposed. In particular, the authors introduce the definition of compliance as the 
average probability of the target samples to be generated from an sample-based estimation of the 
source MDPs. The compliance is later used to determine the proportion of samples to be transferred 
from each of the source tasks. Although this algorithm shares a similar objective as BAT, they use 
different notions of similarity. In particular, the method in [8 1 tries to identify source tasks which 
are individually similar to the target task, while the transfer error minimized in BAT considers the 
average MDP obtained by the transfer process. Furthermore, the notion of compliance tries to 
measures the overall distance between two MDPs, while £\{Q) always measures the distance of the 
images of a function Q through two different Bellman operators. 

Remark 6 (Computational complexity). Finally, we notice that the minimization of £\ is a con- 
vex quadratic problem since the objective function is convex in A and A belongs to the (M-2)- 
dimensional simplex. 



5 Best Transfer Trade-off Algorithm 

The previous algorithm is proved to successfully estimate the combination of source tasks which 
better approximates the Bellman operator of the target task. Nonetheless, BAT relies on the implicit 
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Input: Linear space T = span-f^i, 1 < i < d}, initial function Q° £ T, maximum 
number of samples available for each task N m , transfer parameter c 

Build a training set {X a ,A a , Rl, . . . , i?"}f =1 and the next states {Y 3 ] t , Y 3 M t }f =1 for 
each state-action pair 

for k = 1, 2, . . . do 

Compute $ = argmi n(3g[cu] Af Sp + c \J^m~^^ 

Run one iteration of AST (Fig.fj} using L samples generated according to f3 
end for 



Figure 3: A pseudo-code for Best Tradeoff Transfer (BTT). 

assumption that L samples can always be generated from any source task0 and it cannot be applied 
to the case where the number of source samples is limited. Here we consider the more challenging 
case where the designer has still access to the source tasks but only a limited number of samples 
is available in each of them. In this case, an adaptive transfer algorithm should solve a tradeoff 
between selecting as many samples as possible, so as to reduce the estimation error, and choosing 
the proportion of source samples properly, so as to control the transfer error. The solution of this 
tradeoff may return non-trivial results, where source tasks similar to the target task but with few 
samples are removed in favor of a pool of tasks whose average roughly approximate the target task 
but can provide a larger number of samples. 

Here we introduce the Best Tradeoff Transfer (BTT) algorithm (see Figure [3]). Similar to BAT, it 
relies on an auxiliary training set to solve the tradeoff. We denote by N m the maximum number of 
samples available for source task m. Let f3 £ [0, 1] M be a weight vector, where j3 m is the fraction 
of samples from task m used in the transfer process. We denote by (Ep) the transfer error 
(the estimated transfer error) with proportions A where A m = (/3 m N m )/ ^2 m ,(f3 m > N m i). At each 
iteration k, BTT returns the vector (3 which optimizes the tradeoff between estimation and transfer 
errors, that is 

/3 fc =arg min (s p (Q k - 1 ) + r J— ), (6) 

where r is a parameter. While the first term accounts for the transfer error induced by (3, the second 
term is the estimation error due to the total amount of samples used by the algorithm. 

Unlike AST and BAT, BTT is a heuristic algorithm motivated by the performance bound in Theo- 
rem Q] and we do not provide any theoretical guarantee about its performance. The main technical 
difficulty w.r.t. the previous algorithms is that the setting considered here does not match the random 
task design assumption (see Def. [TJ since the number of source samples is constrained by N m . As 
a result, given a proportion vector A, we cannot assume samples to be drawn at random according 
to a multinomial of parameters A. Without this assumption, it is an open question whether a similar 
bound to AST and BAT could be derived. Nonetheless, the experimental results reported in Section|6] 
show the effectiveness of BTT in solving the transfer tradeoff. 

6 Experiments 

In this section, we report and discuss preliminary experimental results of the transfer algorithms 
introduced in the previous sections. The main objective is to illustrate the functioning of the algo- 
rithms and compare their results with the theoretical findings. Thus, we focus on a simple problem 
and we leave more challenging problems for future work. 

We consider a continuous extension of the 50-state variant of the chain walk problem proposed in |6| . 
The state space is described by a continuous state variable x and two actions are available: one that 

3 If A m = 1 for task m, then the algorithm would generate all the L training samples from task m. 
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Table 1 : Parameters for the first set of tasks 



Table 2: Parameters for the second set of tasks 



tasks 


P 


I 


n 




Reward 


Mi 


0.9 


1 


0.1 


+ lin[ 


-11, -9] U [9, 11] 


Mi 


0.9 


2 


0.1 


-5in[ 


-11, -9] U [9, 11] 


M 3 


0.9 


1 


0.1 


+5 in [ 


-11, -9] U [9, 11] 


Ma 


0.9 


1 


0.1 


+ 1 in 


'-6, -4] U [4,6] 


M 5 


0.9 


1 


0.1 


-1 in 


-6, -4] U [4,6] 



tasks 


P 


I 


n 




Reward 


Mx 


0.9 


1 


0.1 


+ lin 


-11, -9] U [9, 11] 


Me 


0.7 


1 


0.1 


+ lin 


-11, -9] U [9, 11] 


M 7 


0.1 


1 


0.1 


+ lin 


-11, -9] U [9, 11] 


Ms 


0.9 


1 


0.1 


—5 in 


-11, -9] U [9, 11] 


Mg 


0.7 


1 


0.5 


+5 in 


-11, -9] U [9, 11] 




Figure 4: Transfer from M.2, M3, M4, M.$. Left: Comparison between single-task learning, AST 
with L = 10000, BAT with L = 1000, 5000, 10000. Right: Source task probabilities estimated by 
BAT algorithm as a function of FQI iterations. 



moves toward left and the other toward right. With probability p each action makes a step of length I, 
affected by a noise 77, in the intended direction, while with probability 1 — p it moves in the opposite 
direction. For the target task M.\, the state-transition model is defined by the following parameters: 
p = 0.9, I = 1, and 77 is uniform in the interval [—0.1,0.1]. The reward function provides +1 
when the system state reaches the regions [—11, —9] and [9, 11] and elsewhere. Furthermore, to 
evaluate the performance of the transfer algorithms previously described, we considered eight source 
tasks {M2, ■ ■ ■ , -Mg} whose state-transition model parameters and reward functions are reported 
in Tab. Q] and [2] To approximate the Q-functions, we use a linear combination of 20 radial basis 
functions. In particular, for each action, we consider 9 Gaussians with means uniformly spread in 
the interval [—20, 20] and variance equal to 16, plus a constant feature. The number of iterations for 
the FQI algorithm has been empirically fixed to 13. Samples are collected through a sequence of 
episodes, each one starting from the state xq = with actions chosen uniformly at random. For all 
the experiments, we average over 100 runs and we report standard deviation error bars. 

We first consider the pure transfer problem where no target samples are actually used in the learning 
training set (i.e., Ai = 0). The objective is to study the impact of the transfer error due to the use 
of source samples and the effectiveness of BAT in finding a suitable combination of source tasks. 
The left plot in Fig. [4] compares the performances of FQI with and without the transfer of samples 
from the first four tasks listed in Tab.Q] In case of single-task learning, the number of target samples 
refers to the samples used at learning time, while for BAT it represents the size S of the auxiliary 
training set used to estimate the transfer error. Thus, while in single-task learning the performance 
increases with the target samples, in BAT they just make estimation of £\ more accurate. The 
number of source samples added to the auxiliary set for each target sample was empirically fixed 
to one (T = 1). We first run AST with L = 10000 and A 2 = A 3 = A 4 = A 5 = 0.25 (which 
on average corresponds to 2500 samples from each source). As it can be noticed by looking at the 
models in Tab. [T] this combination is very different from the target model and AST does not learn 
any good policy. On the other hand, even with a small set of auxiliary target samples, BAT is able to 
learn good policies. Such result is due to the existence of linear combinations of source tasks which 
closely approximate the target task A4i at each iteration of FQI. An example of the proportion 
coefficients computed at each iteration of BAT is shown in the right plot in Fig. [4] At the first 
iteration, FQI produces an approximation of the reward function. Given the first four source tasks, 
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Figure 5: Transfer from M.§, M.7, Ms, Mg. Left: Comparison between single-task learning and 
BAT with L = 1000, 5000, 10000. Right: Comparison between single-task learning, BAT with 
L = 1000, 10000 in addition to the target samples, and BTT (r = 0.75) with 5000 and 10000 
samples for each source task. To improve readability, the plot is truncated at 5000 target samples. 



BAT finds a combination (A ~ (0.2, 0.4, 0.2, 0.2)) that produces the same reward function as 1Z\. 
However, after a few FQI iterations, such combination is no more able to accurately approximate 
functions TxQ. In fact, the state-transition model of task M.^ is different from all the other ones 
(the step length is doubled). As a result, the coefficient A2 drops to zero, while a new combination 
among the other source tasks is found. Note that BAT significantly improves single-task learning, in 
particular when very few target samples are available. 

In the general case, the target task cannot be obtained as any combination of the source tasks, as it 
happens by considering the second set of source tasks M.7, A4$, M.q)- The impact of such 

situation on the learning performance of BAT is shown in the left plot in Fig. [5] Note that, when 
a few target samples are available, the transfer of samples from a combination of the source tasks 
using the BAT algorithm is still beneficial. On the other hand, the performance attainable by BAT is 
bounded by the transfer error corresponding to the best source task combination (which in this case 
is large). As a result, single-task FQI quickly achieves a better performance. 

Results presented so far for the BAT transfer algorithm assume that FQI is trained only with the 
samples obtained through combinations of source tasks. Since a number of target samples is already 
available in the auxiliary training set, a trivial improvement is to include them in the training set 
together with the source samples (selected according to the proportions computed by BAT). As 
shown in the plot in the right side of Fig.|5]this leads to a significant improvement. From the behavior 
of BAT it is clear that with a small set of target samples, it is better to transfer as many samples as 
possible from source tasks, while as the number of target samples increases, it is preferable to reduce 
the number of samples obtained from a combination of source tasks that actually does not match the 
target task. In fact, for L = 10000, BAT has a much better performance at the beginning but it is 
then outperformed by single-task learning. On the other hand, for L = 1000 the initial advantage is 
small but the performance remains close to single-task FQI for large number of target samples. This 
experiment highlights the tradeoff between the need of samples to reduce the estimation error and 
the resulting transfer error when the target task cannot be expressed as a combination of source tasks 
(see Section |5}. BTT algorithm provides a principled way to address such tradeoff, and, as shown 
by the right plot in Fig. [5] it exploits the advantage of transferring source samples when a few target 
samples are available, and it reduces the weight of the source tasks (so as to avoid large transfer 
errors) when samples from the target task are enough. It is interesting to notice that increasing the 
number of samples available for each source task from 5000 to 10000 improves the performance 
in the first part of the graph, while keeping unchanged the final performance. This is due to the 
capability of the BTT algorithm to avoid the transfer of source samples when there is no need for 
them, thus avoiding negative transfer effects. 
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7 Conclusions 



In this paper, we formalized and studied the sample-transfer problem. We first derived a finite- 
sample analysis of the performance of a simple transfer algorithm which includes all the source 
samples into the training set used to solve a given target task. At the best of our knowledge, this 
is the first theoretical result for a transfer algorithm in RL showing the potential benefit of transfer 
over single-task learning. Then, in the case when the designer has direct access to the source tasks, 
we introduced an adaptive algorithm which selects the proportion of source tasks so as to minimize 
the bias due to the use of source samples. Finally, we considered a more challenging setting where 
the number of samples available in each source task is limited and a tradeoff between the amount 
of transferred samples and the similarity between source and target tasks must be solved. For this 
setting, we proposed a principled adaptive algorithm. Finally, we report a detailed experimental 
analysis on a simple problem which confirms and supports the theoretical findings. 

This work opens several directions for future work. 

• Transfer with transformations. In many problems, there exist simple transformations to the 
source tasks dynamics and reward which would increase their similarity w.r.t. the target 
task, thus making the transfer process more effective. How affine transformations could be 
used in the adaptive transfer algorithms presented in this paper is an interesting direction 
for future work. In particular, it is an open question whether the cost (in terms of samples) 
of finding a suitable transformation would be effectively counter-balanced by transferring 
more similar samples. 

• Transfer between tasks with different state-action spaces. In many real applications source 
and target tasks might have a different number of state variables and different actions. Thus, 
the current work should be extended to the more general case of tasks with different state- 
action spaces and it should be integrated with inter-task mapping transfer methods (see 

ni). 

• Transfer with fixed tasks design. Definition Q] prescribes the process used to generate the 
training set used in learning the target task. At each state-action pair, the sample is gener- 
ated from a source task chosen at random according to a multinomial distribution. When 
the designer has no access to the source tasks and their samples are generated beforehand, 
this generative model is not reasonable. A different model (fixed tasks design) should be 
defined where each sample is coming from a specific source which is fixed in advance. An 
interesting direction for future work is to understand how this different generative model af- 
fects the performance of the transfer algorithm and whether it is possible to define effective 
adaptive algorithms for this case. 
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Input: Linear space J~ = span{i^;, 1 < i < d}, initial function Q' 

for k = 1, 2, . . . do 

Draw training samples {(X„, A„, Y n , i?„)}^ =1 

Build the feature matrix $ = [(f>(X 1 , Ai) T ; . . . ; <^>(X n , A„) T ] 

Compute the vector p n = R n + jraa,x a , eA Q k ~ 1 (Y n , a') 

Compute the projection & k = (<!> T <I>)~ 1 <E> T p 

Return the truncated function Q k — T(fs, k ) 

end for 



Figure 6: A pseudo-code for Fitted Q-iteration. 



A Additional Notation 

Besides the notation introduced in Section [2] here we introduce additional symbols used in the 
proofs. We define two empirical norms on functions and vectors. Given a set of N state-action pairs 
{(X n , A n )}n = i drawn i.i.d. from /i we define the empirical norm as 

1 N 

Tl=l 

Similarly, given a vector y 6 we define the empirical norm | \y\ | jy as 

1 N 

II II 2 — 2 

WVWN — ^7 / .Vn- 
n=l 

Given a set of N state-action pairs {(!„, A n )}% =1 , let $ = [<f>(Xi, Ai) T ; . . . ; 4>{X N , A N ) T ] be 
the feature matrix defined at the states {(X n , A n )}% =1 , and T n = a G R d } C R N be the 

corresponding vector space. We denote by II : R N — > Fn the empirical orthogonal projection onto 
Fn, defined by 

fly = argmin||y- z\\ N . (7) 
Note that the orthogonal projection Ily of any y £ M. N always exists and is unique. 

B Fitted Q-iteration with Linear Spaces 

Although fitted iterative methods have been already analyzed in detail in ifTTI and (TJ, at the best 
of our knowledge no explicit finite-sample bounds for FQI with linear spaces is available. Since at 
each iteration, FQI solves an explicit regression problem, the derivation is mostly a straightforward 
application of regression bounds for linear spaces and quadratic loss. Here we just report the result 
and the proof of the single iteration error for the so-called fixed and random samples design settings. 

In Algorithm[6]we report the structure of the algorithm. 
B.l Fixed Samples Design 

Similar to the analysis of LSTD in [7] we first derive the fixed design bound (i.e., the performance 
is evaluated exactly on the states in the training set). 

Theorem 4. Let J- = {</>(■, -) T ct, a e R d } be a d- dimensional linear space. Let 
{(x n , a n , Y ni R n )}n=i be the training set where {(x n , a n )}n=i an arbitrary sequence of state- 
action pairs, Y n ~ V(-\x n , a n ), and R n = lZ(x n , a n ). Given a function Q € B{X x A, V max ), let 
q £ M. N be the vector whose components are q n = (TQ)(x n , a n ) and q be the solution of a single 
iteration of fitted value iteration. Then with probability 1 — S (w.r.t. the random next states Y n ), q 
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</ = ftp / 






u = Tlq 





Figure 7: This figure shows that the vectors used in the proof of Theorem|4] 



satisfies 



\q-q\\N < ling- q\\n + 4V m 



I 2 , /3(9Ne 2 ) d + 1 
N l0g { 6 



(8) 



Proof. We denote by u E M. N the orthogonal projection of the target vector q onto the vector space 
Tn, that is u = Hq. By the definition of orthogonal projection and the Pythagorean theorem we 
decompose the error \ \q — g||jv as 



W-qWn = \ \q- u \\n + \\ u ~ q\\ 2 N, 



(9) 



where the first term represents the estimation error and the second term is the approximation error 
(see Fig. |7). We denote by £„ = p n — q n the noise in the observations p w.r.t. q. It is easy to notice 
that 



E 



Y~T>(-\x n ,a n ) 



lZ(x n , a n , Y) + 7maxQ(y, a') 



(TQ)(x n ,a T , 



0, (10) 



and that |£„| < 2V max . We also define the projected noise £„ = q n — u n , that is £ = II£. Thus, we 
can rewrite the estimation error as 



\q-u\\ 2 N = \\i\\ 2 N = {i,i) = {t,i), 



(id 



where the last equality follows from the fact that £ is the orthogonal projection of £. Since £ e Tn, 
let fp € J 7 be any function such that fp{x n , a n ) = £„, and by a straightforward application of a 
variation of Pollard's inequality [5 1 we obtain 

1/2 



i N (i N v [2 

(£,0 = j-f^2tnfp{x n ,a n ) < 4\/ max — ^/^(a: n ,a„) 2 j J — log 
n=l \ n=l / V 



3(97Ve 



2\d+l 



^maxllelkl 



2 , /3(9iVe 2 ) d+1 
log I - 



N"°\ S 

with probability 1 — S. Thus from equation QT| we bound the estimation error by 



2, /3(9iVe2)d+i 

\q -u\\ N < Wms* J — log I -i j+ 



(12) 



(13) 



Putting together the estimation error bound and the approximation error term, the statement of the 
theorem follows. □ 



B.2 Random Samples Design 

While in the previous section we analyzed the performance of FQI on the very same state-action 
pairs in the training set, we now focus on the generalization (i.e., prediction) performance on the 
whole state-action space. 

Let Q be any function f& 6 T satisfying <M = q, where q is the vector defined in the previous 
section. Then we derive the following theorem. 
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Theorem 5. Let J- = {</>(■, -) T a, a € M. d } be a d-dimensional linear space. Let 
{(X n , A n , Y n , R n )}n=i be the training set where {X n ,A n ) ~ fi, Y n ~ V(-\X n , A n ), and 
R n = TZ(X n , A n ). Given a function Q 6 B(X x A, V max ), let Q be the solution of a single it- 
eration of fitted value iteration. Then with probability 1 — 8 (w.r.t. the samples and the next states), 
Q satisfies 

||T(Q)-rQ|U<4 / inf F ||/ a -TQ|U 



+ 32V M ^log( 27 ^f W) ). (14) 

Proof. The proof mainly relies on the application of concentration of measures inequalities for linear 
spaces to the deterministic design bound in Theorem|4] 

Let f&* G J 7 be any function such that fa, (X n ) = (flq) n , thus the approximation error \\Ilq — <?| I iv 
can be rewritten as \\f&, — TQ\\p,. Furthermore we denote by f a , = H(TQ), that is the best 
approximation of the target function TQ onto T w.r.t. the distribution /i. Since fa, is the minimizer 
of the empirical squared error, any function in T different from f a , has a bigger empirical loss, thus 
we obtain 



TQ\\n < \\f a , -TQ|| A <2||/ Q . -TQIU + 12(V max + L||a,||)y|: log ~, (15) 

with probability 1 — 5', where the second inequality is an application of a variation of Theorem 1 1 .2 
in J3] with a bound \ \f a , — TQ\ \<x> < V max + L\ |a* 1 1. Similar, we notice that the left hand side of 
Eq.Eis \\q-q\\ N = \ \Q - T*Q\ | A and we obtain 



2||Q - TQIU > 2||T(Q)-TQ|| A >||T(Q)-TQ|U -24^4 /-log' ' ' ~ ' 



(16) 

with probability 1 — 8', where the second inequality is an application of a variation of Theorem 1 1 .2 
in 0. Putting together Eqs[8l[T5l and [16] we obtain 



\\T(Q)-TQ\\^ <2( 2\\f a , -TQ\\^ + l2(V max + L\\aJ\) x l ^log^j- 



2, /3(9JVe a V , + 1 \\ „„ Tr 2, /9(12eiV) 2 ( d +« 

^maxJ-TT log J7 1 + 24F roax W — bg ' 



y AT ° V / / y TV \ <5' 

Finally, by setting 8 — 3(5' the statement follows. □ 



C Analysis of AST 
C.l Proof of TheoremQ] 

Proof. Since the proof follows similar steps as in the proof of Theorem [5] we discuss here only 
the fixed samples design bound. We define the vector p € R L such that for any I = 1, . . . , L, 
Pi = Em=i 1 i M i = m i ( R l n + 7 max a' Q( Y "\ «'))• The target vector q E R L is the image of the 
function Q through the average optimal Bellman operator. In fact, by defining qi — (T\Q)(Xi, Ai) 
we obtain a zero-mean noise vector £i = Pi — qi such that E = and |£; | < 2V max .0 

4 The expectation is taken w.r.t. both the random realization of the reward Rf and next state F ; m and task 
index Mi. 
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The statement of the theorem simply follows by decomposing the prediction error of Q as 

\\T(Q) - TxQWv < \\T(Q) - TxQ\\„ + \\TxQ - TiQ\\ M . (17) 
By substituting ||T(Q) — T\Q\\^ with a FQI bound w.r.t. the target function T\Q we obtain 

\\T(Q) - TiQt < 4||/„ - TxQWn + WfxQ - TiQIU (18) 



H-24(i;, lilx + ri|M!|!\/|io g | 



2 /27(12Le 2 ) 2 (' i + 1 )\ 
+ 32V; nax J z logf^ Y . (19) 

By rewriting the approximation error as | \f a - TxQ\ U < Wfa ~ T l Q\\,j. + W^Q - TxQ\\» and 
using a = a* the final bound follows. 

□ 

C.2 Proof of Theorem|2] 

Proof. [Sketch ] The main structure of the proof is exactly the same as in ifTTl . The main differences 
are due to the use of linear spaces and the transfer error. Following the passages in the proof of 
Theorem 2 in IfTTl . we obtain 



\\Q'-Q" K \\ V < 



(1 - 7 )3/2 



Thus, we need to study all the terms in the statement of Theorem [TJ affected by the maximization 
over the iterations. 

Approximation error. The approximation term becomes 

maxmin||/-T 1 Q fc || M < supmin||/ - T 1 ^. 

This term is referred to as the inherent Bellman error of the space T and it is related to how well the 
Bellman images of functions in T can be approximated by T itself. 

Estimation error. The second relevant term is the term 1 1 a^" 1 1 appearing in the estimation error. We 
recall that f a k = HTiQ k ~ 1 is the projection on T of the Bellman image of the function returned 
at the previous iteration. The function Q k ~ 1 is truncated in the interval [— V mSiX , V max ] and its 
Bellman image TiQ k ~ 1 is still bounded in the same interval. Since the projection operator II is a 
non-expansion, we finally have that ||/ a * Hoc < V max . Using Assumption|2] for any f a G F, it is 
possible to relate the norm of the function to the norm of the vector a as 

= II^ T «IIm = ^ T Ga > ua T a - c||a|| 2 . 
By combining the bound on a with the bound on f a , we obtain that 

ii fen ^ II/oJIIm Vm&x 

max a J < max = — < 



fc fc y/Ld ' v'W 

Transfer error. Since Q k is the truncation of a function f & k = Q k belonging to T, the transfer error 
is 

max 1 1 (71 -Tx)Q k \\» =sup||(7i -T A )T(/ a )|| M . 

fc a 

Finally, the statement of the theorem follows by taking a union bound over K iterations. 

□ 
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D Analysis of BAT 
D.l Proof of Theorem|3] 

Lemma 2. Let {{X s , A s , R\, . . . , i?^)}f =1 be a training set where (X S ,A S ) ~ \i and R™ — 
lZ m (X s , A s ) and for each state-action pair and for each task m, T next states Y™ t ~ V m (-\X S , A s ) 

with t = 1, . . . , T are available. For any fixed bounded function Q £ B(X x A] V ma ^), the A 
returned by minimizing Eq.\3\is such that 



1 (M - 2) log AS 1 5 2 log45M/5 

with probability 1 — 5, 



£x(Q) - £\. (Q) < 2K„ ax \/ ^ ' fa 1 + W4* ^ 7 (20) 



Proof. [Lemma\Tjl 

The sketch of the proof is as follows. For any state-action pair X s , A s , we define 



A/ , T M 



5 A (X S , As) = Rl - X ™ R ™ + ^ E ( nmxQ^ 1 ^, a') - jn A m maxQ^Y™, a'; 



m=2 

and 



£ A (A s ,As) = (TiQ*- 1 )^.,^)- J] A m (T m Q fc_1 )(X ;j , A s ). 



m=2 



As a result, E\ = [£ A (x, a) 2 ] and £ A = ^ ^ s =i ^ApCsi A) 2 - By Pollard's inequality on the 
(M-2)-dimensional simplex A, we have for any A £ A 



s 



\E,[E x {x,af] - I^£ A (A S , A) 2 | < V max ^ iM 2 ^ S / 5 ' (21) 

s— 1 

with probability 1 — Using Chernoff-Hoeffding inequality we now bound the distance between 
the true Bellman operators in £\(X S , A s ) and their estimates in £\(X S , A s ). By triangle inequality 
and the previous definitions, we obtain the following series of inequlities 

s— 1 s—1 s—1 

< max (S X (X S , A s )--J2 £\{X ai A,)j 

S—1 

T 

< 2 max max ((T m Q k ~ 1 )(X s , A x ) - R™ - 7 - VmaxQ(F™, a ')) 2 



< 2 , 2y (22) 



T 

By using Eqs[2T|and[22l we have for any A £ A 



f / (M- 2) log 5/^ 2 logSA//S' 

with probability 1 — 28' . Finally, we can prove the following sequence of inequalities 

£ A - £a» = £ A — £ A + £ A — £a„ + £a„ — £\, 



< 2 sup t A - t A < 2K ma x\/ 5 1- loV max = , 

AeA V b 1 

with probability 1 — 48'. By setting <5 = 48' the statement follows. □ 
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Figure 8: Percentage of samples for each task as selected by BTT as a function of FQI iterations 
when 5000 samples are available for each source task. Left: 100 target samples available. Right: 
10000 target samples available. 



E Additional Experimental Analysis 

In this section, we provide additional experimental results related to the BTT algorithm. 
E.l Analysis of parameters /3 

In order to have a better understanding on how BTT trades off between the need for samples and 
the risk of introducing a large transfer error, in Figure [8] we show the values of parameters /3 (which 
represent the percentage of samples transferred from each task) as optimized by the BTT algo- 
rithm at each FQI iteration. The tasks considered are the target task Aii and the source tasks 
Aig, M7, Ms, -Mg, each one with 5000 samples available. Figure [8] compares the values of j3 in 
two scenarios: when the available target samples are 100 (left pane) and 10000 (right pane). Obvi- 
ously, BTT always exploits all the target samples (/?i = 1). When few target samples are available, 
BTT transfers high percentages of samples from the source tasks. In particular, it transfers all the 
samples available from task A4g in each iteration, and also the percentage of samples taken from 
task Ais is almost constant (about 0.7). The percentage of samples transferred from tasks A4$ and 
A^7 starts from 100% and decreases (with different rates) through iterations reaching zero after it- 
eration 10. This behavior can be explained by the attempt to include as many samples as possible 
at the earlier iterations when it is still possible to find combinations of sources with a small transfer 
error. As the iterations continue, no suitable combination of sources is possible and the algorithm 
is forced to reduce the number of samples from the more different source tasks. On the other hand, 
when the number of target samples is large enough, we notice that the percentage of samples trans- 
ferred from all the source tasks drop down after the first FQI iterations. In fact, in this case, BTT 
exploits a lot of source samples to produce a more accurate approximations only when a very small 
error is introduced. On the other hand, as the iterations progress, the samples from the source tasks 
(even when optimally combined) provide a poor approximation of the Q-functions and, as a result, 
BTT, given the large number of target samples (10000), prefers to reduce the number of samples 
transferred from the source tasks. 

In Figure|9]we show the proportions A induced by the weights (3 computed by BTT. When only 100 
target samples are available, BTT tries to compensate the lack of target samples by transferring a 
large amount of samples from a suitable combination of source tasks, while, when many target sam- 
ples are available, it considers source samples only when they can guarantee a good approximation 
of the target Q-functions, otherwise the proportions are changed in favor of the target samples. 

Finally, in Figure [10] we consider the total number of samples used to train FQI at each iteration 
under the two scenarios. As expected, at the first iterations, due to the similarity between source 
tasks and target task, the number of samples provided to FQI by BTT is very large and then it 
decreases through iterations. It is interesting to notice that the total number of samples selected in 
the two scenarios are quite similar (in particular starting from the third iteration), which is an effect 
of the tradeoff realized by the BTT algorithm. 
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Figure 9: Proportions in the combination of task samples induced by BTT (3 parameters as a function 
of FQI iterations. Left: 100 target samples available. Right: 10000 target samples available. 




Figure 10: Number of samples actually used at each iteration (after the transfer) by FQI. Left: 100 
target samples available. Right: 10000 target samples available. 



E.2 Analysis of parameter r 

The tradeoff realized by the BTT algorithm is tuned by the parameter t multiplying the estimation 
error. In Figure [TT] we analyze the effect of r on the learning performances. Different values of 
the tradeoff parameter have been tried (r = 0.25,0.50,0.75, 1.0) when both 5000 samples (left 
pane) and 10000 samples (right pane) are available for each source task. As we can notice, BTT is 
quite robust w.r.t. the choice of the tradeoff parameter. The main differences appear when a small 
number of target samples is available. In this case, low values of r make BTT more concerned 
about the transfer error and, as a result, it tends to avoid transferring source samples, even if target 
samples are not enough. On the other hand, with high values of r, BTT is pushed to use more source 
samples, and this may negatively affect the performance when several target tasks are available and 
no combination of source tasks provides a good target approximation. 
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Figure 1 1 : Comparison of the performance of FQI using BTT algorithm with different values of the 
tradeoff parameter r. Left: 5000 samples available for each source. Right: 10000 samples available 
for each source. 
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